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Abstract 

Visual  categorization  problems,  such  as  object  classification  or  action  recognition, 
are  increasingly  often  approached  using  a  detection  strategy:  a  classifier  function 
is  first  applied  to  candidate  subwindows  of  the  image  or  the  video,  and  then  the 
maximum  classifier  score  is  used  for  class  decision.  Traditionally,  the  subwindow 
classifiers  are  trained  on  a  large  collection  of  examples  manually  annotated  with 
masks  or  bounding  boxes.  The  reliance  on  time-consuming  human  labeling  ef¬ 
fectively  limits  the  application  of  these  methods  to  problems  involving  very  few 
categories.  Furthermore,  the  human  selection  of  the  masks  introduces  arbitrary 
biases  (e.g.  in  terms  of  window  size  and  location)  which  may  be  suboptimal  for 
classification. 

In  this  report  we  propose  a  novel  method  for  learning  a  discriminative  subwin¬ 
dow  classifier  from  examples  annotated  with  binary  labels  indicating  the  presence 
of  an  object  or  action  of  interest,  but  not  its  location.  During  training,  our  ap¬ 
proach  simultaneously  localizes  the  instances  of  the  positive  class  and  learns  a 
subwindow  SVM  to  recognize  them.  We  extend  our  method  to  classification  of 
time  series  by  presenting  an  algorithm  that  localizes  the  most  discriminative  set 
of  temporal  segments  in  the  signal.  We  evaluate  our  approach  on  several  datasets 
for  object  and  action  recognition  and  show  that  it  achieves  results  similar  and  in 
many  cases  superior  to  those  obtained  with  full  supervision. 
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Figure  1 :  A  unified  framework  for  image  categorization  and  time  series  classifica¬ 
tion  from  weakly  labeled  data.  Our  method  simultaneously  localizes  the  regions 
of  interest  in  the  examples  and  learns  a  region-based  classifier,  thus  building  ro¬ 
bustness  to  background  and  uninformative  signal. 

1  Introduction 

Object  categorization  systems  aim  at  recognizing  the  classes  of  the  objects  present 
in  an  image,  independently  of  the  background.  Early  computer  vision  methods  for 
object  categorization  attempted  to  build  robustness  to  background  clutter  by  using 
image  segmentation  as  preprocessing.  It  was  hoped  that  segmentation  methods 
could  partition  images  into  their  high-level  constituent  parts,  and  categorization 
could  then  be  simply  carried  out  as  recognition  of  the  object  classes  correspond¬ 
ing  to  the  segments.  This  naive  strategy  to  categorization  floundered  on  the  chal¬ 
lenges  presented  by  bottom-up  image  segmentation.  The  difficulty  of  partitioning 
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an  image  into  objects  purely  based  on  low-level  cues  is  now  well  understood  and 
it  has  led  in  recent  years  to  a  flourishing  of  methods  where  bottom-up  segmen¬ 
tation  is  assisted  by  concurrent  top-down  recognition  [31,  17,  4,  27],  However, 
the  application  of  these  methods  has  been  limited  in  practice  by  a)  the  challenges 
posed  by  the  acquisition  of  detailed  ground  truth  segmentations  needed  to  train 
these  systems,  and  b)  the  high  computational  complexity  of  semantic  segmen¬ 
tation,  which  requires  solving  the  classification  problem  at  the  pixel-level.  An 
efficient  alternative  is  provided  by  object  detection  methods,  which  can  perform 
object  localization  without  requiring  pixel-level  segmentation.  Object  detection 
algorithms  operate  by  evaluating  a  classifier  function  at  many  different  subwin¬ 
dows  of  the  image  and  then  predicting  the  object  presence  in  subwindows  with 
high-score.  This  methodology  has  been  applied  with  great  success  to  a  wide  vari¬ 
ety  of  object  classes  [29,  8,  7],  Recent  work  [15]  has  shown  that  efficient  compu¬ 
tation  of  classification  maxima  over  all  possible  subwindows  of  an  image  is  even 
possible  for  highly  sophisticated  classifiers,  such  as  SVMs  with  spatial  pyramid 
kernels.  Although  great  advances  have  been  made  in  terms  of  reducing  the  com¬ 
putational  complexity  of  object  detection  algorithms,  their  accuracy  has  remained 
dependent  on  the  amount  of  human- annotated  data  available  to  train  them.  Sub¬ 
windows  (or  bounding  boxes)  are  obviously  less-time  consuming  to  collect  than 
detailed  segmentations.  However,  the  dependence  on  human  work  for  training  in¬ 
evitably  limits  the  scalability  of  these  methods.  Furthermore,  not  only  the  amount 
of  ground  truth  data  but  also  the  characteristics  of  the  human  selections  may  af¬ 
fect  the  detection.  For  example,  it  has  been  shown  [8]  that  the  specific  size  and 
location  of  the  selections  may  have  a  significant  impact  on  performance.  In  some 
cases,  including  a  margin  around  the  bounding  box  of  the  training  selections  will 
lead  to  better  detection  because  of  statistical  correlation  between  the  appearance 
of  the  region  surrounding  the  object  (often  referred  to  as  the  “spatial  context”)  and 
the  category  of  the  object  (e.g.  cars  tend  to  appear  on  roads).  However,  it  is  rather 
difficult  to  tune  the  amount  of  context  to  include  for  optimal  classification.  The 
problem  is  even  more  acute  for  the  case  of  categorization  of  time  series  data.  Con¬ 
sider  the  task  of  automatically  monitoring  the  behavior  of  an  animal  based  on  its 
body  movement.  It  is  safe  to  believe  that  the  intrinsic  differences  between  the  dis¬ 
tinct  animal  activities  (e.g.  drinking,  exploring,  etc.)  do  not  appear  continuously 
in  the  examples  but  are  rather  associated  to  specific  movement  patterns  (e.g.  the 
turning  of  the  head,  a  short  fast-pace  walk,  etc.)  possibly  occurring  multiple  times 
in  the  sequences.  Thus,  as  for  the  case  of  object  categorization,  classification 
based  on  comparisons  of  the  whole  signals  is  unlikely  to  yield  good  performance. 
However,  if  we  asked  a  person  to  localize  the  most  discriminative  patterns  in  such 
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sequences,  we  would  obtain  highly  subjective  annotations,  unlikely  to  be  optimal 
for  the  training  of  a  classifier. 

In  this  report  we  propose  a  novel  learning  framework  that  simultaneously  lo¬ 
calizes  the  most  discriminative  subwindows  in  the  data  and  learns  a  classifier  to 
distinguish  them.  Our  algorithm  requires  only  the  class  labels  as  annotation  for 
the  training  examples,  and  thus  eliminates  the  high  cost  and  arbitrariness  of  human 
ground  truth  selections.  In  the  case  of  object  categorization,  our  method  optimizes 
an  SVM  classification  objective  with  respect  to  both  the  classifier  parameters  and 
the  subwindows  containing  the  object  of  interest  in  the  positive  image  examples. 
In  the  case  of  classification  of  time  series,  we  relax  the  subwindow  contiguity 
constraint  in  order  to  discover  discriminative  patterns  which  may  occur  discontin- 
uously  over  the  observation  period.  Specifically,  we  allow  the  discriminative  pat¬ 
terns  to  occur  in  at  most  k  disjoint  time-intervals,  where  k  is  a  problem-dependent 
tunable  parameter  of  our  system.  The  algorithm  solves  for  the  locations  and  du¬ 
rations  of  these  intervals  while  learning  the  SVM  classifier.  We  demonstrate  our 
approach  on  several  object  and  activity  recognition  datasets  and  show  that  our 
weakly-supervised  classifiers  consistently  match  and  often  surpass  the  accuracy 
of  SVMs  trained  under  full  supervision. 


2  Related  Work 

Most  prior  work  on  weakly  supervised  object  localization  and  classification  is 
based  on  the  use  of  region  or  part-based  generative  models.  Fergus  et  al.  [12] 
represent  objects  as  flexible  constellation  of  parts  by  learning  probabilistic  mod¬ 
els  of  both  the  appearance  as  well  as  the  mutual  position  of  the  parts.  Parts  are 
selected  from  points  found  by  a  feature  detector.  Classification  of  a  test  image 
is  performed  in  a  Bayesian  fashion  by  evaluating  the  detected  features  using  the 
learned  model.  The  performance  of  this  system  rests  completely  on  the  ability  of 
the  feature  detector  to  fire  consistently  at  points  corresponding  to  the  learned  parts 
of  the  model.  Russell  et  al.  [23]  instead  propose  a  fully-unsupervised  algorithm 
to  discover  objects  and  associated  segments  from  a  large  collection  of  images. 
Multiple  segmentations  are  computed  from  each  image  by  varying  the  parame¬ 
ters  of  a  segmentation  method.  The  key-assumption  is  that  each  object  instance  is 
correctly  segmented  at  least  once  and  that  the  features  of  correct  segments  form 
object-specific  coherent  clusters  discoverable  using  latent  topic  models  from  text 
analysis.  Although  the  algorithm  is  shown  to  be  able  to  discover  many  different 
types  of  objects,  its  effectiveness  as  a  categorization  technique  is  unclear.  Cao 
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and  Fei-Fei  [5]  further  extend  the  latent  topic  model  by  assuming  that  a  single 
topic  model  is  responsible  for  generating  the  image  patches  within  each  region 
of  the  image,  thus  enforcing  spatial  coherence  within  each  segment.  Todorovic 
and  Ahuja  [26]  describe  a  system  that  learns  tree-based  representations  of  mul¬ 
tiscale  image  segmentations  via  a  subtree  matching  algorithm.  A  multitude  of 
algorithms  based  on  Multiple  Instance  Learning  (MIL)  have  been  recently  pro¬ 
posed  for  training  object  classifiers  with  weakly  supervised  data  (see  [19,  30,  2,  6] 
for  a  sampling  of  these  techniques).  Most  of  these  methods  view  images  as  bags 
of  segments,  traditionally  computed  using  bottom-up  segmentation  or  fixed  parti¬ 
tioning  of  the  image  into  blocks.  Then  MIL  trains  a  discriminative  binary  classifier 
predicting  the  class  of  segments,  under  the  assumption  that  each  positive  training 
image  contains  at  least  one  true-positive  segment  (corresponding  to  the  object  of 
interest),  while  negative  training  images  contain  none.  However,  these  approaches 
incur  in  the  same  problem  faced  by  the  early  segmentation-based  recognition  sys¬ 
tems:  segmentation  from  low-level  cues  is  often  unable  to  provide  semantically 
correct  segments.  Galleguillos  et  al.  [13]  attempt  to  circumvent  this  problem  by 
providing  multiple  segmentations  to  the  MIL  learning  algorithm  in  the  hope  one 
of  them  is  correct.  The  approach  we  propose  does  not  rely  on  unreliable  segmen¬ 
tation  methods  as  preprocessing.  Instead,  it  performs  localization  while  training 
the  classifier.  Our  work  can  also  be  viewed  as  an  extension  of  feature  selection 
methods,  in  which  different  features  are  selected  for  each  example.  The  idea  of 
joint  feature  selection  and  classifier  optimization  has  been  proposed  before,  but 
always  in  combination  with  strongly  labeled  data.  Schweitzer  [24]  proposes  a 
linear  time  algorithm  to  select  jointly  a  subset  of  pixels  and  a  set  of  eigenvectors 
that  minimize  the  Rayleigh  quotient  in  Linear  Discriminant  Analysis.  Nguyen  et 
al.  [20]  propose  a  convex  formulation  to  simultaneously  select  the  most  discrim¬ 
inative  pixels  and  optimize  the  SVM  parameters.  However,  both  aforementioned 
methods  require  the  training  data  to  be  well  aligned  and  the  same  set  of  pixels  is 
selected  for  every  image.  Felzenszwalb  et  al.  [11]  describe  Latent  SVM,  a  pow¬ 
erful  classification  framework  based  on  a  deformable  part  model.  However,  also 
this  method  requires  knowing  the  bounding  boxes  of  foreground  objects  during 
training.  Finally,  Blaschko  and  Lampert  [3]  use  supervised  structured  learning  to 
improve  the  localization  accuracy  of  SVMs. 

The  literature  on  weakly  supervised  or  unsupervised  localization  and  catego¬ 
rization  applied  to  time  series  is  fairly  limited  compared  to  the  object  recognition 
case.  Zhong  et  al.  [32]  detect  unusual  activities  in  videos  by  clustering  equal- 
length  segments  extracted  from  the  video.  The  segments  falling  in  isolated  clus¬ 
ters  are  classified  as  abnormal  activities.  Fanti  et  al.  [10]  describe  a  system  for 
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unsupervised  human  motion  recognition  from  videos.  Appearance  and  motion 
cues  derived  from  feature  tracking  are  used  to  learn  graphical  models  of  actions 
based  on  triangulated  graphs.  Niebles  et  al.  [21]  tackle  the  same  problem  but  rep¬ 
resent  each  video  as  a  bag  of  video  words,  i.e.  quantized  descriptors  computed 
at  spatial-temporal  interest  points.  An  EM  algorithm  for  topic  models  is  then 
applied  to  discover  the  latent  topics  corresponding  to  the  distinct  actions  in  the 
dataset.  Localization  is  obtained  by  computing  the  MAP  topic  of  each  word. 


3  Localization-classification  SVM 

In  this  section  we  first  propose  an  algorithm  to  simultaneously  localize  objects 
of  interest  and  train  an  SVM.  We  then  extend  it  to  classification  of  time  series 
data  by  presenting  an  efficient  algorithm  to  identify  in  the  signal  an  optimal  set  of 
discriminative  segments,  which  are  not  constrained  to  be  contiguous. 

3.1  The  learning  objective 

Assume  we  are  given  a  set  of  positive  training  images  {dj1" }  and  a  set  of  negative 
training  images  {d“}  corresponding  to  weakly  labeled  data  with  labels  indicating 
for  each  example  the  presence  or  absence  of  an  object  of  interest.  Let  £<S(d) 
denote  the  set  of  all  possible  subwindows  of  image  d.  Given  a  subwindow  x  € 
£5(d),  let  <^(x)  be  the  feature  vector  computed  from  the  image  subwindow.  We 
learn  an  SVM  for  joint  localization  and  classification  by  solving  the  following 
constrained  optimization: 


1 1 1  II2 

minimize  -  w  , 

(1) 

w  ,b  2 

s.t.  max  |wT03(x) 

+  b} 

>  1  Vi, 

(2) 

xe£S(d+) 

max  {wT(^(x) 

+  b} 

<  —1  Vi. 

(3) 

xe£S(d-) 

The  constraints  appearing  in  this  objective  state  that  each  positive  image  must 
contain  at  least  one  subwindow  classified  as  positive,  and  that  all  subwindows  in 
each  negative  image  must  be  classified  as  negative.  The  goal  is  then  to  maximize 
the  margin  subject  to  these  constraints.  By  optimizing  this  problem  we  obtain  an 
SVM,  i.e.  parameters  (w,  b ),  that  can  be  used  for  localization  and  classification. 
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Given  a  new  testing  image  d,  localization  and  classification  are  done  as  follows. 
First,  we  find  the  subwindow  x  yielding  the  maximum  SVM  score: 

x  =  arg  max  wT(p(x).  (4) 

xg£<S(d) 

If  the  value  of  wT(p(x)  +  b  is  positive,  we  report  x  as  the  detected  object  for  the 
test  image.  Otherwise,  we  report  no  detection. 

As  in  the  traditional  formulation  of  SVM,  the  constraints  are  allowed  to  be 
violated  by  introducing  slack  variables: 

minimize  ^j|w||J  +  (7  ai  +  C'S~'  /3i:  (5) 

i  i 

s.t.  max  (wT99(x)  +  b}  >  1  —  cc*  V?',  (6) 

xg£S(d+) 

max  (wTy?(x)  +  b}  <  —  1  +  fa  Vi,  (7) 

xg£S(d-) 

Oii  >  0,  >  0  Vi. 

Here,  C  is  the  parameter  controlling  the  trade-off  between  having  a  large  margin 
and  less  constraint  violation. 

3.2  Optimization 

Our  objective  is  in  general  non-convex.  We  propose  optimization  via  a  coordinate 
descent  approach  that  alternates  between  optimizing  the  objective  w.r.t.  parame¬ 
ters  (w,  b,  {«*},  {fli})  and  finding  the  subwindows  of  images  {dj1-}  U  {d-  }  that 
maximize  the  SVM  scores.  However,  since  the  cardinality  of  the  sets  of  all  pos¬ 
sible  subwindows  may  be  very  large,  special  treatment  is  required  for  constraints 
of  type  (7).  We  use  constraint  generation  to  handle  these  constraints:  £cS(d/:  ) 
is  iteratively  updated  by  adding  the  most  violated  constraint  at  every  step.  Al¬ 
though  constraint  generation  has  exponential  running  time  in  the  worst  case,  it 
often  works  well  in  practice. 

The  above  optimization  requires  at  each  iteration  to  localize  the  subwindow 
maximizing  the  SVM  score  in  each  image.  Thus,  we  need  a  very  fast  localization 
procedure.  For  this  purpose,  we  adopt  the  representation  and  algorithm  described 
in  [151.  Images  are  represented  as  bags  of  visual  words  obtained  by  quantizing 
SIFT  descriptors  [18]  computed  at  random  locations  and  scales.  For  quantization, 
we  use  a  visual  dictionary  built  by  applying  K- means  clustering  to  a  set  of  de¬ 
scriptors  extracted  from  the  training  images  [25].  The  set  of  possible  subwindows 
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for  an  image  is  taken  to  be  the  set  of  axis-aligned  rectangles.  The  feature  vector 
p (x)  is  the  histogram  of  visual  words  associated  with  descriptors  inside  rectan¬ 
gle  x.  Lampert  et  al.  [15]  showed  that,  when  using  this  image  representation,  the 
search  for  the  rectangle  maximizing  the  SVM  score  can  be  executed  efficiently  by 
means  of  a  branch-and-bound  algorithm. 

3.3  Extension  to  time  series 

As  in  the  case  of  image  categorization,  even  for  time  series  the  global  statistics 
computed  from  the  entire  signal  may  yield  suboptimal  classification.  For  example, 
the  differences  between  two  classes  of  temporal  signals  may  not  be  visible  over 
the  entire  observation  period.  However,  unlike  in  the  case  of  images  where  objects 
often  appear  as  fully-connected  regions,  the  patterns  of  interest  in  temporal  signals 
may  not  be  contiguous.  This  raises  a  technical  challenge  when  extending  the 
learning  formulation  of  Eq.  (5)  to  time  series  classification:  how  to  efficiently 
search  for  sets  of  non-contiguous  discriminative  segments?  In  this  section  we 
describe  a  representation  of  temporal  signals  and  a  novel  efficient  algorithm  to 
address  this  challenge. 

3.3.1  Representation  of  time  series 

Time  series  can  be  represented  by  descriptors  computed  at  spatial-temporal  inter¬ 
est  points  [16,  9,  21].  As  in  the  case  of  images,  sample  descriptors  from  training 
data  can  be  clustered  to  create  a  visual-temporal  vocabulary  [9].  Subsequently, 
each  descriptor  is  represented  by  the  ID  of  the  corresponding  vocabulary  entry 
and  the  frame  number  at  which  the  point  is  detected.  In  this  work,  we  define  a 
k- segmentation  of  a  time  series  as  a  set  of  k  disjoint  time-intervals,  where  k  is 
a  tunable  parameter  of  the  algorithm.  Note  that  it  is  possible  for  some  intervals 
of  a  /^-segmentation  to  be  empty.  Given  a  /^-segmentation  x,  let  p (x)  denote  the 
histogram  of  visual-temporal  words  associated  with  interest  points  in  x.  Let  C% 
denote  the  set  of  words  occurring  at  frame  i.  Let  o*  =  Yhc^c  wc  if  Ct  is  non¬ 
empty,  and  Oj  =  0  otherwise,  a*  is  the  weighted  sum  of  words  occurring  in  frame 
i  where  word  c  is  weighted  by  SVM  weight  wc.  From  these  definitions  it  follows 
that  w 7V(X)  =  <l'i-  For  Fist  localization  of  discriminative  patterns  in  time 

series  we  need  an  algorithm  to  efficiently  find  the  /^-segmentation  maximizing  the 
SVM  score  w Tp(x).  Indeed,  this  optimization  can  be  solved  globally  in  a  very 
efficient  way.  The  following  section  describes  the  algorithm.  In  the  appendix,  we 
prove  the  optimality  of  the  solution  produced  by  this  algorithm. 
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3.3.2  An  efficient  localization  algorithm 


Let  n  be  the  length  of  the  time  signal  and  Z  =  {[1,  u]  :  1  <  l  <  u  <  n}  be  the 
set  of  all  subintervals  of  [1,  n\.  For  a  subset  S  C  {1,  *  •  •  ,  n},  let  f(S)  =  Ylies  ai- 
Maximization  of  w 1  <p(x)  is  equivalent  to: 

k 

maximize  Y''  f(Ij)  s.t.  e  I  &  /,:  fl  Ij  =  Vi  ^  j.  (8) 

h,—Jk 

3  =  1 

This  problem  can  be  optimized  very  efficiently  using  Algo.  1  presented  below. 


Algorithm  1  Find  best  k  disjoint  intervals  that  optimize  (8) 

Input:  ai,  •  •  •  ,  an,  k  >  1. 

Output:  a  set  Xk  of  best  k  disjoint  intervals. 

1:  X°  :=  (f). 

2:  for  m  =  0  to  k  —  1  do 

3:  Ji  :=  arg  maxjeI  /(•/)  s.t.  J  (T  S  =  <f>  VS'  G  Xm. 

4:  J-2  :=  argmaxjgj  —  /( J)  s.t.  J  C  S  G  Am. 

5:  if /(Ji)  > -/(J2)  then 
6:  Am+1  :=  Xm  U  { Ji} 

7:  else 

8:  Let  S'  G  Xrn  :  J2  C  S.  S  is  divided  into  three  disjoint  intervals: 

s  =  s~uj2us+. 

9:  Xm+1  :=  (Xm  -  {5})  U  {S~,  5+} 


This  algorithm  progressively  finds  the  set  of  rn  intervals  (possibly  empty)  that 
maximize  (8)  for  m  —  1,  •  •  •  ,  k.  Given  the  optimal  set  of  m  intervals,  the  optimal 
set  of  m  +  1  intervals  is  obtained  as  follows.  First,  find  the  interval  J\  that  has 
maximum  score  f{J\)  among  the  intervals  that  do  not  overlap  with  any  currently 
selected  interval  (line  3).  Second,  locate  J2,  the  worst  subinterval  of  all  currently 
selected  intervals,  i.e.  the  subinterval  with  lowest  score  /( J2)  (line  4).  Finally,  the 
optimal  set  of  m  +  1  intervals  is  constructed  by  executing  either  of  the  following 
two  operations,  depending  on  which  one  leads  to  the  higher  objective: 

1.  Add  Ji  to  the  optimal  set  of  rn  intervals  (line  6); 

2.  Break  the  interval  of  which  J2  is  a  subinterval  into  three  intervals  and  re¬ 
move  J2  (line  9). 
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Algo.  1  assumes  J\  and  J2  can  be  found  efficiently.  This  is  indeed  the  case. 
We  now  describe  the  procedure  for  finding  J\.  The  procedure  for  finding  J2  is 
similar. 

Let  Xm  denote  the  relative  complement  of  Xm  in  [1.  n],  i.e.  Xrn  is  the  set  of 
intervals  such  that  the  “union”  of  the  intervals  in  Xm  and  Xm  is  the  interval  [1,  n] . 
Since  Xrn  has  at  most  to  elements,  Xm  has  at  most  m  +  1  elements.  Since  J\ 
does  not  intersect  with  any  interval  in  Xm,  it  must  be  a  subinterval  of  an  interval 
of  Xm.  Thus,  we  can  find  J\  as  J\  =  argmaxSg^-  f(Js)  where: 

Js  =  arg  max  /  ( J) .  (9) 

Eq.  (9)  is  a  basic  operation  that  is  needed  to  be  performed  repeatedly:  finding 
a  subinterval  of  an  interval  that  maximizes  the  sum  of  elements  in  that  subinterval. 
This  operation  can  be  performed  by  Algo.  2  below  with  running  time  complexity 
O(n).  Note  that  the  result  of  executing  (9)  can  be  cached;  we  do  not  need  to 

Algorithm  2  Find  the  best  subinterval 
Input:  cii,  •  •  •  ,an,  an  interval  [l,u]  C  [1,  n\. 

Output:  [si,  su]  C  [l,  u]  with  maximum  sum  of  elements. 

1:  b0  :=  0. 

2:  for  m  =  1  to  n  do 

3:  brn  :=  brn_ }  +  am.  //compute  integral  image 

4:  [si,  su]  :=  [0,  0];  val  :=  0.  //empty  subinterval 

5:  m  l  —  1.  //index  for  minimum  element  so  far 

6:  for  m  =  l  to  u  do 

7:  if  bm  —  bfh  >  val  then 

8:  [si,  su]  :=  [m  +  1;  to];  val  :=  brn  —  bfh 

9:  else  if  bm  <  bfh  then 

10:  to  :=  to.  //keep  track  of  the  minimum  element 


recompute  Js  for  many  S  at  each  iteration  of  Algo.  1.  Thus  the  total  running 
complexity  of  Algo.  1  is  O  (nk).  Algo.  1  guarantees  to  produce  a  globally  optimal 
solution  for  (8)  (see  the  appendix). 

4  Experiments 

This  section  describes  experiments  on  several  datasets  for  object  categorization 
and  time  series  classification. 
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Figure  2:  Examples  taken  from  (a)  the  CMU  Face  Images  and  (b)  the  street  scene 
dataset. 

4.1  Object  localization  and  categorization 

4.1.1  Experiments  on  car  and  face  datasets 

This  subsection  presents  evaluations  on  two  image  collections.  The  first  exper¬ 
iment  was  performed  on  CMU  Face  Images,  a  publicly  available  dataset  from 
the  UCI  machine  learning  repository1.  This  database  contains  624  face  images 
of  20  people  with  different  expressions  and  poses.  The  subjects  wear  sunglasses 
in  roughly  half  of  the  images.  Our  classification  task  is  to  distinguish  between 
the  faces  with  sunglasses  and  the  faces  without  sunglasses.  Some  image  exam¬ 
ples  from  the  database  are  given  in  Fig.  2(a).  We  divided  this  image  collection 
into  disjoint  training  and  testing  subsets.  Images  of  the  first  8  people  are  used  for 
training  while  images  of  the  last  12  people  are  reserved  for  testing.  Altogether, 
we  had  254  training  images  (126  with  glasses  and  128  without  glasses)  and  370 
testing  images  (185  examples  for  both  the  positive  and  the  negative  class). 

The  second  experiment  was  performed  on  a  dataset  collected  by  us.  Our  col¬ 
lection  contains  400  images  of  street  scenes.  Half  of  the  images  contain  cars  and 
half  of  them  do  not.  This  is  a  challenging  dataset  because  the  appearance  of  the 
cars  in  the  images  varies  in  shape,  size,  grayscale  intensity,  and  location.  Further¬ 
more,  the  cars  occupy  only  a  small  portion  of  the  images  and  may  be  partially 
occluded  by  other  objects.  Some  examples  of  images  from  this  dataset  are  shown 
in  Fig.  2(b).  Given  the  limited  amount  of  examples  available,  we  applied  4-fold 

1  http://archive.ics.uci.edu/ml/datasets/CMU+Face+Images 
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cross  validation  to  obtain  an  estimate  of  the  performance. 

Each  image  is  represented  by  a  set  of  10,000  local  SIFT  descriptors  [18]  se¬ 
lected  at  random  locations  and  scales.  The  descriptors  are  quantized  using  a  dic¬ 
tionary  of  1,000  visual  words  obtained  by  applying  hierarchical  K -means  [22]  to 
100,000  training  descriptors. 

In  order  to  speed  up  the  learning,  an  upper  constraint  on  the  rectangle  size  is 
imposed.  In  the  first  experiment,  as  the  image  size  is  120  x  128  and  the  sizes 
of  sunglasses  are  relative  small,  we  restrict  the  height  and  width  of  permissible 
rectangles  to  not  exceed  30  and  50  pixels,  respectively.  Similarly,  for  the  second 
experiment,  we  constrain  permissible  rectangles  to  have  height  and  width  no  larger 
than  300  and  500  pixels,  respectively  (c.f.  image  size  of  600  x  800). 

We  compared  our  approach  to  several  competing  methods.  SVM  denotes  a 
traditional  SVM  approach  in  which  each  image  is  represented  by  the  histogram  of 
the  words  in  the  whole  image.  BoW  is  the  bag-of-words  method  [22]  in  the  imple¬ 
mentation  of  [28].  It  uses  a  10-nearest  neighbor  classifier.  We  also  benchmark  our 
method  against  SVM-FS  [15],  a  fully  supervised  method  requiring  ground  truth 
subwindows  during  training  (FS  stands  for  fully  supervised).  SVM-FS  trains  an 
SVM  using  ground  truth  bounding  boxes  as  positive  examples  and  ten  random 
rectangles  from  each  negative  image  for  negative  data. 

Tab.  1  shows  the  classification  performance  measured  using  both  the  accu¬ 
racy  rates  and  the  areas  under  the  ROCs.  Note  that  our  approach  outperforms  not 
only  SVM  and  BoW  (which  are  based  on  global  statistics),  but  also  SVM-FS, 
which  is  a  fully  supervised  method  requiring  the  bounding  boxes  of  the  objects 
during  training.  This  suggests  that  the  boxes  tightly  enclosing  the  objects  of  in¬ 
terest  are  not  always  the  most  discriminative  regions.  Our  method  automatically 
localizes  the  subwindows  that  are  most  discriminative  for  classification.  Fig.  3(a) 
shows  discriminative  detection  on  a  few  face  testing  examples.  Sunglasses  are 
the  distinguishing  elements  between  positive  and  negative  classes.  Our  algorithm 
successfully  discovers  such  regions  and  exploits  them  to  improve  the  classifica¬ 
tion  performance.  Fig.  3(b)  shows  some  examples  of  car  localization.  Parts  of 
the  road  below  the  cars  tend  to  be  included  in  the  detection  output.  This  suggests 
that  the  appearance  of  roads  is  a  contextual  indication  for  the  presence  of  cars. 
Fig.  4  displays  several  difficult  cases  where  our  method  does  not  provide  good 
localization  of  the  objects. 

SVM,  SVM-FS,  and  our  proposed  method  require  tuning  of  a  single  param¬ 
eter,  C,  controlling  the  trade-off  between  a  large  margin  and  less  constraint  vi¬ 
olation.  This  parameter  is  tuned  using  4-fold  cross  validation  on  training  data. 
The  parameter  sweeping  is  done  exactly  in  the  same  fashion  for  all  algorithms. 
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Dataset 

Measure 

BoW 

SVM 

SVM-FS 

Ours 

Faces 

Ace.  (%) 

80.11 

82.97 

86.79 

90.0 

ROC  Area 

n/a 

0.90 

0.94 

0.96 

Cars 

Ace.  (%) 

77.5 

80.75 

81.44 

84.0 

ROC  Area 

n/a 

0.86 

0.88 

0.90 

Table  1:  Comparison  results  on  the  CMU  Face  and  car  datasets.  BoW:  bag  of 
words  approach  [22],  SVM:  SVM  using  global  statistics.  SVM-FS  [15]  requires 
bounding  boxes  of  foreground  objects  during  training.  Our  method  is  significantly 
better  than  the  others,  and  it  outperforms  even  the  algorithm  using  strongly  labeled 
data. 


Optimizing  (5)  is  an  iterative  procedure,  where  each  iteration  involves  solving  a 
convex  quadratic  programming  problem.  Our  implementation  uses  CVX,  a  pack¬ 
age  for  specifying  and  solving  convex  programs  [14].  We  found  that  our  algorithm 
generally  converges  within  100  iterations  of  coordinate  descent. 


(a) 


(b) 

Figure  3:  Localization  of  (a)  sunglasses  and  (b)  cars  on  test  images.  Note  how 
the  road  below  the  cars  is  partially  included  in  the  detection  output.  This  indicates 
that  the  appearance  of  road  serves  as  a  contextual  indication  for  the  presence  of 
cars. 
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a  b  c  d 


Figure  4:  Difficult  cases  for  localization,  a,  b:  sunglasses  are  not  clearly  visible 
in  the  images,  c:  the  foreground  object  is  very  small,  d:  misdetection  due  to  the 
presence  of  the  trailer  wheel. 

4.1.2  Experiments  on  Caltech-4 

This  subsection  describes  an  experiment  on  the  publicly  available2  Caltech-4  dataset. 
This  collection  contains  images  of  different  categories:  airplanes_side,  background, 
cars_bard,  faces,  and  motorbikes_side.  We  consider  binary  classification  tasks 
where  the  goal  is  to  distinguish  one  of  the  four  object  classes  (airplanes_side, 
cars_bard,  faces,  and  motorbikes_side)  from  the  background  clutter  class.  In  this 
experiment,  we  randomly  sample  a  set  of  100  images  from  each  class  for  training. 
The  set  of  the  remaining  images  is  split  into  equal-size  testing  and  validation  sets. 
The  validation  data  is  used  for  parameter  tuning. 

Tab.  2  shows  the  results  of  this  experiment.  As  shown,  SVM-FS,  a  method 
that  requires  bounding  boxes  of  the  foreground  objects  for  training,  does  not  per¬ 
form  as  well  as  SVM  which  is  based  on  global  statistics  from  the  whole  image. 
This  result  suggests  that  contextual  information  is  very  important  for  classifica¬ 
tion  tasks  on  this  dataset.  Indeed,  it  is  easy  to  verify  by  visual  inspection  that  the 
image  backgrounds  here  often  provide  very  strong  categorization  cues  (see  e.g. 
the  almost  constant  background  of  the  face  images).  As  a  result  our  method  can¬ 
not  provide  any  significant  advantage  on  this  dataset.  However  note  that,  unlike 
SVM-FS,  our  joint  localization  and  classification  does  not  harm  the  classification 
performance  as  our  algorithm  automatically  learns  the  importance  of  contextual 
information  and  uses  large  subwindows  for  recognition. 

4.2  Classification  of  time  series  data 

This  section  describes  our  classification  experiments  on  time  series  datasets. 

2http://www.robots.ox.ac.ukAvgg/data3.html 
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Class 

Measure 

BoW 

SVM 

SVM-FS 

Ours 

Airplanes 

Ace.  (%) 

89.74 

96.05 

89.40 

96.05 

ROC  Area 

n/a 

0.99 

0.95 

0.99 

Cars 

Ace.  (%) 

94.93 

98.17 

n/a 

98.28 

ROC  Area 

n/a 

1.00 

n/a 

1.00 

Faces 

Ace.  (%) 

59.83 

88.70 

86.78 

89.57 

ROC  Area 

n/a 

0.95 

0.91 

0.95 

Motorbikes 

Ace.  (%) 

76.80 

88.99 

84.67 

87.81 

ROC  Area 

n/a 

0.95 

0.92 

0.94 

Table  2:  Results  of  binary  classification  between  each  of  the  four  classes  of 
Caltech-4  and  the  background  clutter  class.  BoW:  bag  of  word  approach  [22], 
SVM:  traditional  SVM  using  global  statistics.  SVM-FS  [15]  is  the  SVM  method 
that  require  strongly  labeled  data  during  training.  Results  of  SVM-FS  for  the  Cars 
class  is  displayed  as  n/a  because  of  the  unavailability  of  ground  truth  annotation. 

4.2.1  A  synthetic  example 

The  data  in  this  evaluation  consists  of  800  artificially  generated  examples  of  bi¬ 
nary  time  series  (400  positive  and  400  negative).  Some  examples  are  shown  in 
Fig.  5.  Each  positive  example  contains  three  long  segments  of  fixed  length  with 
value  1.  We  refer  to  these  as  the  foreground  segments.  Note  that  the  end  of  a  fore¬ 
ground  segment  may  meet  the  beginning  of  another  one,  thus  creating  a  longer 
foreground  segment  (see  e.g.  the  bottom  left  signal  of  Fig.  5).  The  locations  of 
the  foreground  segments  are  randomly  distributed.  Each  negative  example  con¬ 
tains  fewer  than  three  foreground  segments.  Both  positive  and  negative  data  are 
artificially  degraded  to  simulate  measurement  noise:  with  a  certain  probability, 
zero  energy  values  are  flipped  to  have  value  1 .  The  temporal  length  of  each  signal 
is  100  and  the  length  of  each  foreground  segment  is  10.  We  split  the  data  into  sep¬ 
arate  training  and  testing  sets,  each  containing  400  examples  (200  positive,  200 
negative). 

We  evaluated  the  ability  of  our  algorithm  to  discover  automatically  the  dis¬ 
criminative  segments  in  these  weakly-labeled  examples.  We  trained  our  localization- 
classification  SVM  by  learning  /^-segmentations  for  values  of  k  ranging  from  1  to 
20.  Note  that  the  algorithm  has  no  knowledge  of  the  length  or  the  type  of  the 
pattern  distinguishing  the  two  classes.  Tab.  3  summarizes  the  performance  of  our 
approach.  Traditional  SVM,  based  on  the  statistics  of  the  whole  signals,  yields 
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0  20  40  60  80  100  0  20  40  60  80  100 


Figure  5:  What  distinguishes  the  time  series  on  the  left  from  the  ones  on  the 
right?  Left:  positive  examples,  each  containing  three  long  segments  with  value  1 
at  random  locations.  Right:  negative  examples,  each  containing  fewer  than  three 
long  segments  with  value  1.  All  signals  are  perturbed  with  measurement  noise 
corresponding  to  spikes  with  value  1  at  random  locations. 


k 

1 

2 

3  to  7 

8 

12 

16 

20 

Acc.(%) 

77.0 

93.0 

100 

98.5 

91.5 

77.5 

67.25 

ROC  Area 

.843 

.980 

1.00 

.998 

.933 

.793 

.613 

Table  3:  Classification  performance  on  temporal  data  using  our  approach.  We 
show  the  accuracy  rates  and  the  ROC  areas  obtained  using  different  values  of  k, 
the  number  of  discriminative  time  intervals  used  by  the  algorithm.  Here  traditional 
SVM,  based  on  the  global  statistics  of  the  signals,  yields  an  accuracy  rate  of  66.5% 
and  an  area  under  the  ROC  of  0.577. 
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Figure  6:  Example  frames  from  the  mouse  videos. 


an  accuracy  rate  of  66.5%  and  an  area  under  the  ROC  of  0.577.  Thus,  our  ap¬ 
proach  provides  much  better  accuracy  than  SVM.  Note  that  the  performance  of 
our  method  is  relatively  insensitive  to  the  choice  of  k,  the  number  of  discrim¬ 
inative  time-intervals  used  for  classification.  It  achieves  100%  accuracy  when 
the  number  of  intervals  are  in  the  range  3  to  7;  it  works  relatively  well  even  for 
other  settings.  In  practice,  one  can  use  cross  validation  to  choose  the  appropriate 
number  of  segments.  Furthermore,  Tab.  3  reaffirms  the  need  of  using  multiple 
intervals:  our  classifier  built  with  only  one  interval  achieves  only  an  accuracy  rate 
of  77%. 


4.2.2  Mouse  behavior 

We  now  describe  an  experiment  of  mouse  behavior  recognition  performed  on  a 
publicly  available  dataset3.  This  collection  contains  videos  corresponding  to  five 
distinct  mouse  behaviors:  drinking,  eating,  exploring,  grooming,  and  sleeping. 
There  are  seven  groups  of  videos,  corresponding  to  seven  distinct  recording  ses¬ 
sions.  Because  of  the  limited  amount  of  data,  performance  is  estimated  using 
leave-one-group-out  cross  validation.  This  is  the  same  evaluation  methodology 
used  by  Dollar  et  al.  [9].  Fig.  6  shows  some  representative  frames  of  the  clips. 
Please  refer  to  [9]  for  further  details  about  this  dataset. 

We  represent  each  video  clip  as  a  set  of  cuboids  [9]  which  are  spatial-temporal 
local  descriptors.  From  each  video  we  extract  cuboids  at  interest  points  computed 
using  the  cuboid  detector  [9].  To  these  descriptors  we  add  cuboids  computed  at 
random  locations  in  order  to  yield  a  total  of  2500  points  for  each  video  (this  aug¬ 
mentation  of  points  is  done  to  cancel  out  effects  due  to  differing  sequence  lengths). 
A  library  of  50  cuboid  prototypes  is  created  by  clustering  cuboids  sampled  from 

3http://vision. ucsd.edu/~pdollar/research/research.html 
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Action 

Dollar  et  al.  [9] 

1-NN 

SVM 

Ours 

Drink 

0.63 

0.58 

0.63 

0.67 

Eat 

0.92 

0.87 

0.91 

0.91 

Explore 

0.80 

0.79 

0.85 

0.85 

Groom 

0.37 

0.23 

0.44 

0.54 

Sleep 

0.88 

0.95 

0.99 

0.99 

Table  4:  FI  scores:  detection  performance  of  several  algorithms.  Higher  FI  scores 
indicate  better  performance. 

training  data  using  k -means.  Subsequently,  each  cuboid  is  represented  by  the  ID 
of  the  closest  prototype  and  the  frame  number  at  which  the  cuboid  was  extracted. 
We  trained  our  algorithm  with  values  of  k  varying  from  1  to  3.  Here  we  report  the 
performance  obtained  with  the  best  setting  for  each  class. 

A  performance  comparison  is  shown  in  Tab.  4.  The  second  column  shows 
the  result  reported  by  Dollar  et  al.  [9]  using  a  1 -nearest  neighbor  classifier  on 
histograms  containing  only  words  computed  at  spatial-temporal  interest  points. 
1  -NN  is  the  result  obtained  with  the  same  method  applied  to  histograms  including 
also  random  points.  SVM  is  the  traditional  SVM  approach  in  which  each  video 
is  represented  by  the  histogram  of  words  over  the  entire  clip.  The  performance  is 
measured  using  the  FI  score  which  is  defined  as: 

^  2  •  Recall  ■  Precision 

Recall  +  Precision 

Here  we  use  this  measure  of  performance  instead  of  the  ROC  metric  because  the 
latter  is  designed  for  binary  classification  rather  than  detection  tasks  [1].  Our 
method  achieves  the  best  FI  score  on  all  but  one  action. 


5  Conclusions  and  Future  Work 

This  report  proposes  a  novel  framework  for  discriminative  localization  and  classi¬ 
fication  from  weakly  labeled  images  or  time  series.  We  show  that  the  joint  learning 
of  the  discriminative  regions  and  of  the  region-based  classifiers  leads  to  catego¬ 
rization  accuracy  superior  to  the  performance  obtained  with  supervised  methods 
relying  on  costly  human  ground  truth  data.  In  future  work  we  plan  to  investigate 
an  unsupervised  version  of  our  approach  for  automatic  discovery  of  object  classes 
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and  actions  from  unlabeled  collections  of  images  and  videos.  Furthermore,  we 
would  like  to  extend  our  /^-segmentation  model  to  images  in  order  to  improve  the 
recognition  of  objects  having  complex  shapes. 


Appendix  -  Proof  of  global  optimality  of  Algorithm  1 

Algo.  1  guarantees  to  produce  a  globally  optimal  solution  for  (8).  Even  stronger, 
the  set  Xrn  =  { I\n .  ■  ■  ■  ,  }  produced  by  the  algorithm  is  the  set  of  best  m  inter¬ 

vals  that  maximize  (8).  This  section  sketches  a  proof  by  induction. 

+)  m  —  1,  this  can  be  easily  verified. 

+)  Suppose  Xm  is  the  set  of  best  m  intervals  that  maximize  (8).  We  now  prove  that 
Xrn+ 1  is  optimal  for  m  +  1  intervals.  Assume  the  contrary,  Xrn+ 1  is  not  optimal 
for  m  +  1  intervals.  There  exist  disjoint  intervals  Tj,  •  •  •  ,  Tm+ 1  such  that: 

m+1  m+1 

E -W)  >  E /«”+1).  (n) 

i=l  i=  1 

Because  the  way  we  construct  Xm+1  from  Xm,  we  have: 

m+1  m 

E  /«”+1)  =  E  / (7.™)  +  max{/(J,),  -/(J2)}, 

i= 1  i=l 

where  J\  =  argmax/(J)  s.t.  J  fl  7™  =  (f)  Vi, 

J2  =  arg  max  —/(«/)  s.t.  J  C  If1  for  an  i 

This,  together  with  (1 1),  leads  to: 

m+1  m 

max{/(J,),-/(J2)}  <  E/<r<)  -  E/W”)-  (14) 

i=l  2=1 

Consider  the  overlapping  between  Tj,  •  •  •  ,  Tm+1  and  /j",  •  •  •  ,  J™,  there  are  two 
cases. 


(12) 

(13) 


18 


•  Case  1:  3j  :  Tj  fl  Ij"  =  0  Vi  In  this  case,  we  have: 

m+1  m 

m)  <  f(j i)  <  e  /w)  -  E  /«”).  as) 
2=1  2=1 

m 

^E^/-’”)<  E  .ft7*)-  (16) 

1  2=1, m+1, 2^ 

This  contradicts  with  the  assumption  that  {10,  •  •  •  ,  I™}  is  the  set  of  best  m  inter¬ 
vals  that  maximize  (8). 

•  Case  2:  Vj,3i  :  7}  fl  0  0.  Since  there  are  m  +  1  70s,  and  there  are 

only  m  I™’ s,  there  must  exist  one  i  s.t.  If*  intersects  with  at  least  two  of  70 s. 
Suppose  /,  0 ,  /2  are  indexes  s.t.  70 ,  n  Ij"  0  0  and  70  fl  Ij"  0  0.  Furthermore, 
suppose  70 . 70  are  consecutive  intervals  of  70s  (70 ,  precedes  7/2  and  there  is 
no  Tj  in  between).  Let  7^  =  [0000],  7/.,  =  [00,00]].  Consider  the  interval 

7  =  [0+  +  1,00  —  1],  Because  7?1  fl  Ijn  0  0  and  7/2  fl  70  0  0,  7  must  be  a 

subinterval  of  Ij",  i.e.  7  C  Ij" .  Hence 

m+1  m 

-  m  <  -hj2)  <  y /w)  -  E 

2=1  2=1 

m  m+1 

2=1  2=1 

m 

=>Yf(C)  <  /Piiurur,,)  +  V  /(Ts) 

^  an  interval 

This  contradicts  with  the  assumption  that  {10,  •  •  •  ,  I™}  is  the  best  set  of  m  inter¬ 
vals  that  maximize  (8). 

Since  both  cases  lead  to  a  contradiction,  T’m+1  must  be  the  best  set  of  m  +  1 
intervals  that  maximize  (8).  This  completes  the  proof  □. 


(17) 

(18) 
(19) 
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